AITopics | text categorization

This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.

category, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.13885

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Rie Johnson, Tong Zhang

Neural Information Processing SystemsOct-2-2025, 12:24:11 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
Asia > China > Beijing > Beijing (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.83)

Add feedback

Performance Analysis of Supervised Machine Learning Algorithms for Text Classification

Mishu, Sadia Zaman, Rafiuddin, S M

arXiv.org Artificial IntelligenceSep-3-2025

The demand for text classification is growing significantly in web searching, data mining, web ranking, recommendation systems, and so many other fields of information and technology. This paper illustrates the text classification process on different datasets using some standard supervised machine learning techniques. Text documents can be classified through various kinds of classifiers. Labeled text documents are used to classify the text in supervised classifications. This paper applies these classifiers on different kinds of labeled documents and measures the accuracy of the classifiers. An Artificial Neural Network (ANN) model using Back Propagation Network (BPN) is used with several other models to create an independent platform for labeled and supervised text classification process. An existing benchmark approach is used to analyze the performance of classification using labeled documents. Experimental analysis on real data reveals which model works well in terms of classification accuracy.

machine learning, natural language, text classification, (13 more...)

arXiv.org Artificial Intelligence

2509.00983

Genre: Research Report (0.66)

Industry: Education (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.74)
(2 more...)

Add feedback

Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text

Jarca, Andrei, Croitoru, Florinel Alin, Ionescu, Radu Tudor

arXiv.org Artificial IntelligenceFeb-18-2025

Masked language modeling has become a widely adopted unsupervised technique to pre-train language models. However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at https://github.com/JarcaAndrei/TIACBM.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.12953

Country: Europe > Romania (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.50)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Neural Information Processing SystemsJan-14-2025, 09:02:38 GMT

This paper presents a new semi-supervised framework with convolutional neural networks (CNNs) for text categorization. Unlike the previous approaches that rely on word embeddings, our method learns embeddings of small text regions from unlabeled data for integration into a supervised CNN. The proposed scheme for embedding learning is based on the idea of two-view semi-supervised learning, which is intended to be useful for the task of interest even though the training is done on unlabeled data. Our models achieve better results than previous approaches on sentiment classification and topic classification tasks.

Add feedback

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Neural Information Processing SystemsMar-13-2024, 02:30:31 GMT

This paper presents a new semi-supervised framework with convolutional neural networks (CNNs) for text categorization. Unlike the previous approaches that rely on word embeddings, our method learns embeddings of small text regions from unlabeled data for integration into a supervised CNN. The proposed scheme for embedding learning is based on the idea of two-view semi-supervised learning, which is intended to be useful for the task of interest even though the training is done on unlabeled data. Our models achieve better results than previous approaches on sentiment classification and topic classification tasks.

convolution layer, unlabeled data, vector, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
Asia > China > Beijing > Beijing (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

🇺🇸 Machine learning job: Senior Machine Learning Engineer (Remote) at Team Go (work from anywhere in US!)

#artificialintelligenceAug-9-2021, 18:46:09 GMT

You will primarily work on our text categorization and scoring models using tools like SpaCy, Spark NLP, Textacy, Gensim, Sci-kit Learn, and Tensorflow. In this role, you'll work with a modern data stack and a serverless streaming data architecture. Our stack can be described as a collection of microservices using tools such as AWS Lambda, Kinesis Firehose, AWS S3, AWS Glue, Amazon Athena, API Gateway, SageMaker, Mode Analytics, and Spark [Databricks]. About You You have a BS or higher in Computer Science, Mathematics, Statistics, Economics or other quantitative field You have at least two years of experience working on applied machine learning systems in production cloud environments (AWS, Google Cloud, etc) You have experience along the entire machine learning product lifecycle, from initial data ingest and data prep, through to modeling and creating REST API endpoints or managing batch inference workloads, and subsequently monitoring model performance and evaluating drift. You're technically competent with the Python data science ecosystem (Pandas, Numpy, SciPy, Sci-kit, Jupyter); Apache Spark, and associated frameworks (Spark NLP, Spark Streaming, Spark MLlib); and Tensorflow/Keras.

iterate, machine learning engineer, senior machine learning engineer, (11 more...)

#artificialintelligence

Country:

Europe (0.06)
North America > United States > California > Santa Clara County > Cupertino (0.05)
North America > United States > Oregon (0.05)
Asia > Singapore (0.04)

Industry: Information Technology (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Johnson, Rie, Zhang, Tong

Neural Information Processing SystemsFeb-14-2020, 07:44:14 GMT

This paper presents a new semi-supervised framework with convolutional neural networks (CNNs) for text categorization. Unlike the previous approaches that rely on word embeddings, our method learns embeddings of small text regions from unlabeled data for integration into a supervised CNN. The proposed scheme for embedding learning is based on the idea of two-view semi-supervised learning, which is intended to be useful for the task of interest even though the training is done on unlabeled data. Our models achieve better results than previous approaches on sentiment classification and topic classification tasks. Papers published at the Neural Information Processing Systems Conference.

Add feedback

Rep the Set: Neural Networks for Learning Set Representations

Skianis, Konstantinos, Nikolentzos, Giannis, Limnios, Stratis, Vazirgiannis, Michalis

arXiv.org Machine LearningApr-3-2019

In several domains, data objects can be decomposed into sets of simpler objects. It is then natural to represent each object as the set of its components or parts. Many conventional machine learning algorithms are unable to process this kind of representations, since sets may vary in cardinality and elements lack a meaningful ordering. In this paper, we present a new neural network architecture, called RepSet, that can handle examples that are represented as sets of vectors. The proposed model computes the correspondences between an input set and some hidden sets by solving a series of network flow problems. This representation is then fed to a standard neural network architecture to produce the output. The architecture allows end-to-end gradient-based learning. We demonstrate RepSet on classification tasks, including text categorization, and graph classification, and we show that the proposed neural network achieves performance better or comparable to state-of-the-art algorithms.

artificial intelligence, dataset, machine learning, (16 more...)

arXiv.org Machine Learning

1904.01962

Country: Europe (0.46)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports > Soccer (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications

Berge, Geir Thore, Granmo, Ole-Christoffer, Tveit, Tor Oddbjørn, Goodwin, Morten, Jiao, Lei, Matheussen, Bernt Viggo

arXiv.org Machine LearningSep-12-2018

Medical applications challenge today's text categorization techniques by demanding both high accuracy and ease-of-interpretation. Although deep learning has provided a leap ahead in accuracy, this leap comes at the sacrifice of interpretability. To address this accuracy-interpretability challenge, we here introduce, for the first time, a text categorization approach that leverages the recently introduced Tsetlin Machine. In all brevity, we represent the terms of a text as propositional variables. From these, we capture categories using simple propositional formulae, such as: if "rash" and "reaction" and "penicillin" then Allergy. The Tsetlin Machine learns these formulae from a labelled text, utilizing conjunctive clauses to represent the particular facets of each category. Indeed, even the absence of terms (negated features) can be used for categorization purposes. Our empirical results are quite conclusive. The Tsetlin Machine either performs on par with or outperforms all of the evaluated methods on both the 20 Newsgroups and IMDb datasets, as well as on a non-public clinical dataset. On average, the Tsetlin Machine delivers the best recall and precision scores across the datasets. The GPU implementation of the Tsetlin Machine is further 8 times faster than the GPU implementation of the neural network. We thus believe that our novel approach can have a significant impact on a wide range of text analysis applications, forming a promising starting point for deeper natural language understanding with the Tsetlin Machine.

machine learning, pattern recognition, tsetlin machine, (20 more...)

arXiv.org Machine Learning

1809.04547

Country:

Europe > Norway > Southern Norway > Agder > Kristiansand (0.04)
Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.48)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

Filters

Collaborating Authors

text categorization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Performance Analysis of Supervised Machine Learning Algorithms for Text Classification

Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

🇺🇸 Machine learning job: Senior Machine Learning Engineer (Remote) at Team Go (work from anywhere in US!)

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Rep the Set: Neural Networks for Learning Set Representations

Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications